LTX 2 Improve encode_video by Accepting More Input Types#13057
LTX 2 Improve encode_video by Accepting More Input Types#13057
encode_video by Accepting More Input Types#13057Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks! Left some comments.
| video_np = video.cpu().numpy() | ||
| """ | ||
| Encodes a video with audio using the PyAV library. Based on code from the original LTX-2 repo: | ||
| https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py |
There was a problem hiding this comment.
(nit): let's put a permalink.
| The sampling rate of the audio waveform. For LTX 2, this is typically 24000 (24 kHz). | ||
| output_path (`str`): | ||
| The path to save the encoded video to. | ||
| video_chunks_number (`int`, *optional*, defaults to `1`): |
There was a problem hiding this comment.
The original LTX-2 code will use a video_chunks_number calculated from the video VAE tiling config, for example in two stage inference:
For the default num_frames value of 121 and default tiling config TilingConfig.default(), I believe this works out to 3 chunks. The idea seems to be that the chunks correspond to each tiled stride when decoding.
In practice, I haven't had any issues with the current code, which is equivalent to just using one chunk. I don't fully understand the reasoning behind why the original code supports it; my guess is that it is useful for very long videos or if there are compute constraints.
There was a problem hiding this comment.
See #13057 (comment) for discussion about some complications for supporting video_chunks_number.
| video = torch.from_numpy(video) | ||
| elif isinstance(video, np.ndarray): | ||
| # Pipeline output_type="np" | ||
| video = (video * 255).round().astype("uint8") |
There was a problem hiding this comment.
Should we check for the value range before doing this? Just to be safe.
| for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number): | ||
| video_chunk_cpu = video_chunk.to("cpu").numpy() | ||
| for frame_array in video_chunk_cpu: | ||
| frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24") |
There was a problem hiding this comment.
Should we let the users control this format? 👀
There was a problem hiding this comment.
I think we could allow the users to specify the format, but this would be in tension with value checking as suggested in #13057 (comment): for example, if we always convert denormalized inputs with values in uint8 values in
We could conditionally convert based on the supplied video_format, but my understanding is that there are a lot of video formats, and I don't think we can anticipate all of the use cases that users may have. So I think we could support a video_format argument with a "use at your own risk" caveat:
elif isinstance(video, np.ndarray):
# Pipeline output_type="np"
is_denormalized = np.logical_and(np.zeros_like(video) <= video, video <= np.ones_like(video))
if np.all(is_denormalized) and video_format == "rgb24":
video = (video * 255).round().astype("uint8")
else:
logger.warning(
f"The video will be encoded using the input `video` values as-is with format {video_format}. Make sure"
f" the values are in the proper range for the supplied format".
)
video = torch.from_numpy(video)An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.
EDIT: the right terminology here might be "pixel format" rather than "video format".
There was a problem hiding this comment.
An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.
Okay let's go with this.
| yield first_chunk | ||
| yield from tiles_generator | ||
|
|
||
| for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number): |
There was a problem hiding this comment.
WDYT of getting rid of all_tiles() and doing it like so?
from itertools import chain
for video_chunk in tqdm(chain([first_chunk], video), total=video_chunks_number):
video_chunk_cpu = video_chunk.to("cpu").numpy()
for frame_array in video_chunk_cpu:
frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
for packet in stream.encode(frame):
container.mux(packet)There was a problem hiding this comment.
This does the right thing but appears not to work well with tqdm, which doesn't update properly from the chain object:
33%|████████████████████████████████▋ | 1/3 [00:04<00:09, 4.57s/it]There was a problem hiding this comment.
Actually, I think #13057 (comment) is wrong - as we generally supply a single torch.Tensor to encode_video (for example from a pipeline output), this line creates an iterator with one element:
diffusers/src/diffusers/pipelines/ltx2/export_utils.py
Lines 146 to 149 in 857735f
So when we call next(video) in the following line, the video iterator is now exhausted. So even if we set video_num_chunks > 1 in this case, our for loop through first_chunk and video will only yield one element in total, whether that's using all_tiles or chain. Thus, the progress bar will end up being wrong in this case since we tell tqdm that we have video_num_chunks > 1 elements when we in fact only have one.
I think the underlying difference is that the original LTX 2 code will return an iterator over decoded tiles when performing tiled VAE decoding, whereas we will return the whole decoded output as a single tensor with the tiles stitched back together. So maybe it doesn't make sense to support video_chunks_number as this will only work well when we supply an Iterator[torch.Tensor] to encode_video (in the current implementation).
What does this PR do?
This PR improves the
diffusers.pipelines.ltx2.export_utils.encode_videofunction for exporting videos with audio by allowing it to accept input types besidestorch.Tensor, such asnp.ndarrayandList[PIL.Image.Image]. This is meant to make it easier to use as users will not have to do special processing such asbefore calling
encode_video; the PR will handle this logic insideencode_video. (Note that using the above processing will still work after the change, but will no longer be necessary.)Changelist
np.ndarrayandList[PIL.Image.Image]videoinputs forencode_video, which are assumed to be outputs from pipelines usingoutput_types of"np"and"pil", respectively.encode_video.encode_videowithout any special processing.Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul
@yiyixuxu